Characterizing genetic transmission networks among newly diagnosed HIV-1 infected individuals in eastern China: 2012–2016

We aimed to elucidate the characteristics of HIV molecular epidemiology and identify transmission hubs in eastern China using genetic transmission network and lineage analyses. HIV-TRACE was used to infer putative relationships. Across the range of epidemiologically-plausible genetic distance (GD) thresholds (0.1–2.0%), a sensitivity analysis was performed to determine the optimal threshold, generating the maximum number of transmission clusters and providing reliable resolution without merging different small clusters into a single large cluster. Characteristics of genetically linked individuals were analyzed using logistic regression. Assortativity (shared characteristics) analysis was performed to infer shared attributes between putative partners. 1,993 persons living with HIV-1 were enrolled. The determined GD thresholds within subtypes CRF07_BC, CRF01_AE, and B were 0.5%, 1.2%, and 1.7%, respectively, and 826 of 1,993 (41.4%) sequences were linked with at least one other sequence, forming 188 transmission clusters of 2–80 sequences. Clustering rates for the main subtypes CRF01_AE, CRF07_BC, and B were 50.9% (523/1027), 34.2% (256/749), and 32.1% (25/78), respectively. Median cluster sizes of these subtypes were 2 (2–52, n = 523), 2 (2–80, n = 256), and 3 (2–6, n = 25), respectively. Subtypes in individuals diagnosed and residing in Hangzhou city (OR = 1.423, 95% CI: 1.168–1.734) and men who have sex with men (MSM) were more likely to cluster. Assortativity analysis revealed individuals were more likely to be genetically linked to individuals from the same age group (AIage = 0.090, P<0.001) and the same area of residency in Zhejiang (AIcity = 0.078, P<0.001). Additionally, students living with HIV were more likely to be linked with students than show a random distribution (AI student = 0.740, P<0.01). These results highlight the importance of Hangzhou City in the regional epidemic and show that MSM comprise the population rapidly transmitting HIV in Zhejiang Province. We also provide a molecular epidemiology framework for improving our understanding of HIV transmission dynamics in eastern China.


Sample collection
Blood samples (5 mL) were collected by the staff of the local Centre for Disease Control and Prevention (CDC) for the routine measurement of CD4 cell counts prior to antiviral treatment. This study included subjects from HIV-infected individuals newly diagnosed between 2012 and 2016 in Zhejiang province whose blood samples remained more than 0.2ml after CD4 cell counts. All the eligible remaining blood samples were collected by the Center of Zhejiang HIV/AIDS confirmative laboratory. Epidemiological information was also collected by the staff of the local CDC. Most information, including age, occupation, transmission route, marital status, education background, residence at diagnosis, and current residence, had been reported via the Chinese HIV/AIDS Comprehensive Response Information Management System (CRIMS).

HIV-1 pol gene sequence analysis
A fragment of HIV-1 pol gene (HXB2 position 2147-3462) was sequenced using DNA samples extracted from blood plasma. The sequences were assembled using Sequencher v5.0 (Gene Codes Corporation, Ann Arbor, MI, USA), aligned against all HIV-1 group M reference sequences available in the Los Alamos HIV Sequence Database (www.hiv.lanl.gov/content/ sequence.html), and subsequently edited manually using BioEdit v7.2.0. The Subtype Classification tool COntext-based Modeling for Expeditious Typing (COMET) was used to subtype all sequences [17].

Sensitivity analysis
HIV-TRACE (www.hiv-trace.org) has been used to infer transmission networks based on HIV subtype B [18]. However, more sensitive and specific methods are needed to identify recent transmission clusters and avoid spurious detection of subtype CRF01_AE and CRF07_BC clusters. Number of transmission clusters and cluster size are the key determinants of the sensitivity and specificity of molecular cluster inference. HIV transmission cluster amount and maximum cluster size with a range of genetic distance (GD) thresholds (0.1-2.0%) among main subtypes were calculated to determine the optimal GD threshold. Pairwise GDs were computed. HIV pol sequences tend to not diverge more than 0.01 substitutions/site from the baseline sequence in the first 10 years of infection [19], and the total sequence divergence tends to be less than 2.0%. Therefore, we explored the effect of using either conservative or liberal distance thresholds ranging between 0.1% and 2.0% [20]. We calculated the number of transmission clusters and the largest cluster sizes for subtypes CRF01_AE, CRF07_BC, and B.
HIV-TRACE was used to construct transmission clusters. HIV-1 pol sequences generated from each individual were used to infer the transmission network. All sequences were aligned to the HXB2 (GenBank accession: K03455) reference sequence to correct for possible frameshifts and sequencing errors. A putative link between two individuals was considered whenever the distance between two sequences (Tamura-Nei 93 model) was below the GD threshold [21]. The evolutionary conservation in this region permitted pairwise alignment, and the pairwise Tamura-Nei 93 (TN 93) model is the most complex evolutionary model and can be computed rapidly via a closed form solution [19][20][21]. When calculating GDs between sequences, we resolved all IUPAC-defined nucleotide ambiguities (i.e. non-ATCG) to the corresponding nucleotide in the other sequences (i.e., Y is zero distance from both T and C). A phylogenetic test of conditional independence was used on each triangle in the network to remove spurious transitive connections. Moreover, 37 codons associated with major resistance in protease and reverse transcriptase were stripped from the alignment [19].

Assortativity analysis
The assortativity Index (AI) was calculated from the district mixing matrix, a matrix comprised of the proportion of relationships between clustering individuals. The AI ranges from -1.0 to 1.0, with AI > 0 indicating that clustered individuals are more likely to be linked with individuals from the same category, AI < 0 indicating that clustered individuals are more likely to be linked with individuals from a different category, and 0 indicating that the relationship between clustered individuals is not influenced by category. We computed Newman's assortativity index to describe the mixing patterns in our dataset using R package igraph [22,23].

Statistical analysis
Statistical analyses were performed using Statistical Product and Service Solutions (SPSS) v19.0 (IBM, Armonk, NY, USA). Non-parametric comparisons were assessed using Person's χ 2 tests or Fisher's exact tests. All P-values < 0.05 were considered significant. Characteristics between clustered and non-clustered individuals in the transmission network were compared using logistic regression analysis. Odds ratios (ORs) and 95% confidence intervals were reported to show the direction and strength of associations. Student's t-tests were performed to analyze means of independent groups.

Ethical approval and informed consent
This study was part of mandated routine analysis of demographic surveillance data reported to the CRIMS. Individuals were assigned identification numbers unique to the study and could be linked back to the original data by authorized personnel only. The raw data did not contain any personally identifying information linked to particular individuals and was anonymized before its use. This study and its protocols were approved by the Ethical Review Committee of Chinese Center for Disease Control and Prevention (X140617334). The consent was waived by the ethics committee. All the procedures were carried out in accordance with approved guidelines and regulations.
Eleven pairs of spouses were identified among the clusters. All appeared in the same cluster with their spouses, and 63.6% (14/22) of them formed dyads. Individuals over 50 years of age made up 68.2% (15/22) of the spousal pairs. The infection routes for 11 women was heterosexual transmission. Nine of eleven male cases were infected via heterosexual behavior, one was infected via sexual encounter with other men, and the last was infected via injection drug use.

Large size clusters
The largest cluster sizes in subtypes CRF07_BC, CRF01_AE, and B contained 80, 52, and 6 individuals, respectively. In these clusters, the proportions of gay men were 80% (64/80), 82.6% (43/52), and 100% (6/6), respectively, which were higher than in other clusters; 44.9% (62/138) of the individuals in these clusters were diagnosed in Hangzhou city. No significant differences in the region of diagnosis, marital status, STI history, HIV infection status, education background, or place of residence were observed among the clusters. Three female cases were included in the large cluster infected with CRF07_BC, and all of them were infected via heterosexual transmission. All of their spouses were HIV-1-positive. However, no sequences were obtained from their spouses.

Characteristics of genetically linked individuals
Population characteristics were compared between clustered and non-clustered individuals ( Table 1). In univariate analysis, the following characteristics were significantly associated with clustering: individuals between 18 and 35 years at the time of HIV diagnosis, unmarried, infected with subtype CRF01_AE, MSM, male, diagnosed and residing in Hangzhou city, and prior HIV testing history (P<0.01). In multivariate analysis, the following characteristics were significantly associated with clustering: MSM and infected with subtype CRF01_AE (P<0.01).

Assortativity analysis between genetically linked individuals
To evaluate epidemic characteristics between genetically linked individuals (putative transmission pairs), we computed assortativity index based on individual demographics. Area of current residence, student status, risk group, age group, marital status, gender, STI and year of diagnosis between linked individuals was assortative (Fig 4). Clustering between individuals was not assortative by gender (AI gender = 0.020, P = 0.199). Individuals were more likely to be genetically linked to individuals of the same age group (AI age = 0.090, P<0.001), from the same area of residence in Zhejiang (AI city = 0.078, P<0.001), diagnosed in the same year (AI year = 0.140, P<0.001), have the same risk group(AI risk = 0.185, P<0.001), and with the same marital status (AI marital = 0.076, P<0.001). Students were more likely to be genetically linked with students (AI student = 0.740, P<0.001). Clustering between individuals was not assortative by STI (AI sti = -0.008, P = 0.589).

Discussion
We analyzed the transmission networks of HIV-1 infections among 1,993 individuals newly diagnosed in Zhejiang province between 2012 and 2016 and determined plausible GDs to identify potential transmission partners. Based on our results, we arrived at three conclusions regarding the dynamics of HIV-1 epidemics. First, the GD within CRF07_BC was significantly lower than that within CRF01_AE and B in Zhejiang. Second, Hangzhou city, the capital of Zhejiang, was a hub for the HIV-1 transmission network in Zhejiang and was therefore the most important region for targeted intervention. MSM and Hangzhou city were risk factors for clustering. Third, students were more likely to link with other students than exhibit a random distribution.
Using sensitivity analysis, we determined that the GD threshold of the CRF07_BC subtype was much lower than that of other subtypes in Zhejiang. This result can be attributed to the dense sampling in the population and the faster rate of spread. CRF01_AE and CRF07_BC are major subtypes in Asia, and most recent studies have used phylogenetic analysis to investigate HIV transmission clusters in China [24,25]. However, phylogenetics is insufficient for epidemiology analysis because it cannot identify recent infection events, and individuals in the cluster are treated as equally connected. Additionally, high support for the clade does not indicate that members of the clade are necessarily closely related to each other [26]. Here, we applied the GD threshold to identify transmission clusters and recent infection events. This method is thought to be sufficient for epidemiological purposes [27]. Lower distance thresholds (e.g. 0.5%) might be more appropriate for distinguishing rapidly growing clusters or populations where more rapid evolution (non-B subtypes) predominates [28]. This low GD may represent highly related and rapidly expanding transmission networks in Zhejiang, and such information could be important for public health control efforts. A GD threshold of 0.5% for The random value is 1000 times, and "1" indicates linked individuals originating from the same category, whereas "-1" indicated those from different category. Assortativity analysis by age group, student status, city of residence, marital status, and year of diagnosis, respectively. Assortativity index for above were AI age = 0.090, AI city = 0.078, AI marital = 0.076, AI student = 0.740, AI year = 0.140, and AI risk = 0.185. CRF07_BC could serve as a useful proxy for epidemiological relatedness in a surveillance setting in eastern China. Subtype B is the main circulating subtype in the US and Western Europe, and the GD threshold of pol sequences ranges from 1.0% to 2.0% [28][29][30]. Although the proportion of the B subtype in China is low, we found that it showed a plausible GD with an acceptable cluster size. Thus, we recommend using a genetic threshold of 1.7% for the B subtype in China to identify clusters to balance sensitivity and specificity.
Hangzhou city is located in southern Zhejiang province, which is the provincial center for economy, culture, education, and tourism. This city accounts for 48.3% of newly diagnosed HIV-1 infections among MSM in Zhejiang province. Up to 73.1% of edges connected one node from Hangzhou to nodes from other regions in Zhejiang Province among MSM infections, inferring potential transmission between Hangzhou and other regions and suggesting that Hangzhou plays an important role in the HIV epidemic in Zhejiang province [10]. The workbook method estimated that 69.6% of MSM living with HIV/AIDS in Zhejiang resided in urban areas [31]. The urban areas contained more convenient sexual venues and more inflow populations, and were more attractive to MSM [32]. Thus, as a metropolitan area, Hangzhou city represents the dominant proportion, based on the risk of clustering in transmission network ( Table 1) and possession of the largest cluster. Current laws and regulations, as well as social stigma, force MSM to hide their sexual orientation and behaviors in both their origin and destination residences [4]. Other studies have shown that populations of Chinese MSM are likely to hide their sexual identity and engage in sexual behaviors in areas other than their hometowns to avoid social stigma [33]. Thus, these findings are crucial for developing targeted interventions for MSM in Hangzhou and are expected to affect the entire province.
We found that city of residence, student status, risk group, age group, marital status, and year of diagnosis between linked individuals was assortative. The AI value for student status was very high, suggesting direct or indirect transmission between students. The questionnaire survey of college students in Qingdao showed that 42.7% of students liked to choose college students as sexual partners, and 86.0% of students looked for sexual partners through the internet [34]. In China, there was a certain spatial and temporal clustering among young MSM students aged 18 to 24 who were reported HIV infection [35]. Targeted prevention and control measures should be carried out in these hot spots and clustering areas to reduce the prevalence of HIV among students. In this study, 92.4% of students were young MSM. Young MSM were more likely to find sexual partners in bars and on the internet, whereas middle-aged men preferred public baths and parks. These differences may be related to the participants' cultural and recreational environments [36]. MSM within the same age group and who engaged in non-internet-based intercourse easily formed closer relationships with strong ties [37], suggesting that different categories of MSM might have relatively separated networks. Alternatively, these differences could be driven by a lack of accessible same-age partners in the social settings in which they are primarily found [38]. China's most popular gay app has grown to include 15 million users in only 2 years [39]. Among social app users, 36.4% met their last partner within 24 hours of the first message exchange [39]. Thus, we can infer that many MSM students find their sexual partners within short times and in close proximity. School sites acted as centers of sexual activity, and some students had sexual encounters with school classmates. In a cross-sectional study determining sexual behavior among 535 college students in Zhejiang Province, 88.9% of MSM students did not insist on using condoms in the previous year [40]. Education regarding potential health issues associated with unprotected sex might influence condom use and prevent the spread of diseases; therefore, it is necessary to strengthen public knowledge of HIV infection among MSM students in Zhejiang province. Moreover, social network platforms can be used by local AIDS publicity campaigns to ensure dissemination of effective health education.  [18], similar to that in Zhejiang province. Moreover, another investigation among populations of MSM in Shenzhen city in 2012 showed that CRF01_AE and CRF07_BC accounted for 32.3% and 43.2% of cases, respectively [10], in contrast to the results observed in Zhejiang province. Therefore, infection transmission by MSM may be subject to geographical constraints.
Univariate comparison between clustering and non-clustering nodes revealed that linked individuals were significantly more likely to be between 18 and 35 years of age, unmarried, infected with subtype CRF01_AE, MSM, male, diagnosed and residing in Hangzhou city, and had prior HIV testing history. This multivariable model revealed that clustering individuals were more commonly MSM and infected with subtype CRF01_AE. The risk factors: unmarried, 18-35 years old, male, diagnosed and residing in Hangzhou city, and prior HIV testing history did not remain significant in the multivariable model. The reason may be that the independent variable MSM is associated with being male, unmarried, and 18-35 years old. Multivariable model can correct the influence of various confounding factors, and the results are often more reliable.
This study has some limitations. First, this study was subjected to sampling bias. Individuals were enrolled by convenience sampling without random selection. However, these data included 1,993 sequences reflecting HIV infection in Zhejiang. Additionally, some data, including drug use and STI history, were self-reported, and incorrect reporting of this information might have introduced bias into our model. The individuals in this study were newly diagnosed between 2012 and 2016, and it was not possible to make a more complete retrospective survey for individuals within transmission clusters. Moreover, these data could not be used to verify the direct or indirect transmission relationships among individuals within transmission clusters. However, the samples obtained from the newly reported MSM students in Zhejiang province in this study accounted for the majority of cases; therefore, the analysis results were thought to have good representation.
In conclusion, this study determined plausible GD thresholds for subtypes CRF07_BC, CRF01_AE, and B for genetic transmission network; these thresholds could serve as references in eastern China. Furthermore, we established a preliminary understanding of transmission in Zhejiang province, and also identified the characteristic of the clustering individuals, highlighting the role of Hangzhou in the transmission network. We found that transmission links were more likely between individuals from the same area of residence. Our findings highlight the importance of transmission network analysis to get a better understanding of regional transmission patterns. Further studies of molecular transmission networks should focus on the individual transmission relationships and individuals with high degrees of edges in the transmission network. Combined with epidemiological information, we can determine the highrisk factors for the population in the transmission network and further guide intervention measures against HIV transmission.
Supporting information S1 File. Sequences in this study and reference sequences used in phylogenetic analyses. 14 reference sequences(1-14 sequences) and 717 sequences of CRF 07_BC in this study were listed together.